Data Science Hackathon¶
Problem Statement (Marketing for Financial Services)¶
- DB Bank is a large public sector bank whuch has branches across all the cities. It provides various services like savings account, current account, term desposits, personal loans, home loans etc. to customers.
- Recently, the bank has conducted a campaign to market their term-deposit scheme. Campaigns were conducted based mostly on direct phone calls, soliciting the bank's customers to place a term deposit. After all the marketing efforts, if the client has agreed to place a deposit, then the campaign is success, otherwise not.
- As a data analyst, we are required to analyse the data which is available at hand, perform some exploratory and descriptive data analysis to identify useful patterns, trends, and insights which will help th marketing team in targeting right customers. Also, we are required to build a predictive model on the given data which can classify the clients on whether they will agree to place a deposit or not correctly.
Step 1: Data Understanding, Cleaning and Missing Value Treatment¶
Importing Necessary Libraries¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from IPython.display import display
pd.options.plotting.backend = 'plotly'
import warnings
warnings.filterwarnings('ignore')
Data Reading and Descriptive Analysis¶
city_master_df = pd.read_csv('Data/City_Master.csv')
state_master_df = pd.read_csv('Dat/State_Master.csv')
region_code_master_df = pd.read_csv('Dat/Region_code_master.csv')
customer_bank_df = pd.read_csv('Dat/Customer_and_bank details_p1.csv')
customer_campaign_df = pd.read_csv('Dat/Customer_campaign_details_p1.csv')
customer_address_df = pd.read_csv('Dat/Customer_Postal_Code_details.csv')
customer_demographics_df = pd.read_csv('Dat/Customer_social_economic_data_p1.csv')
customer_response_df = pd.read_csv('Dat/Customer_Response_data_p1.csv')
display(city_master_df.head())
display(state_master_df.head())
display(region_code_master_df.head())
display(customer_bank_df.head())
display(customer_campaign_df.head())
display(customer_address_df.head())
display(customer_demographics_df.head())
display(customer_response_df.head())
| City_Code | City_Name | State_Code | |
|---|---|---|---|
| 0 | C1 | Henderson | S1 |
| 1 | C2 | Los Angeles | S2 |
| 2 | C3 | Fort Lauderdale | S3 |
| 3 | C4 | Concord | S4 |
| 4 | C5 | Seattle | S5 |
| State_Code | State_Name | Region_Code | |
|---|---|---|---|
| 0 | S1 | Kentucky | 3 |
| 1 | S2 | California | 4 |
| 2 | S3 | Florida | 3 |
| 3 | S4 | North Carolina | 3 |
| 4 | S5 | Washington | 4 |
| Region_Name | Region_Code | |
|---|---|---|
| 0 | Central | 1 |
| 1 | East | 2 |
| 2 | South | 3 |
| 3 | West | 4 |
| 4 | North | 5 |
| Customer_id | age | job | marital | education | default | housing | loan | Region_Code | State_Code | City_Code | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 56 | services | married | high.school | no | no | yes | 3 | S1 | C1 |
| 1 | 2 | 45 | services | married | basic.9y | unknown | no | no | 3 | S1 | C1 |
| 2 | 3 | 59 | admin. | married | professional.course | no | no | no | 4 | S2 | C2 |
| 3 | 4 | 41 | blue-collar | married | unknown | unknown | no | no | 3 | S3 | C3 |
| 4 | 5 | 24 | technician | single | professional.course | no | yes | no | 3 | S3 | C3 |
| Customer_id | contact | month | day_of_week | duration | campaign | pdays | previous | poutcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | telephone | may | mon | 307 | 1 | 999 | 0 | nonexistent |
| 1 | 2 | telephone | may | mon | 198 | 1 | 999 | 0 | nonexistent |
| 2 | 3 | telephone | may | mon | 139 | 1 | 999 | 0 | nonexistent |
| 3 | 4 | telephone | may | mon | 217 | 1 | 999 | 0 | nonexistent |
| 4 | 5 | telephone | may | mon | 380 | 1 | 999 | 0 | nonexistent |
| customer_id | Postal Code | |
|---|---|---|
| 0 | 1 | 42420 |
| 1 | 2 | 42420 |
| 2 | 3 | 90036 |
| 3 | 4 | 33311 |
| 4 | 5 | 33311 |
| Customer_id | emp.var.rate | cons.price.idx | cons.conf.idx | euribor3m | nr.employed | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 |
| 1 | 2 | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 |
| 2 | 3 | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 |
| 3 | 4 | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 |
| 4 | 5 | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 |
| Customer_id | y | |
|---|---|---|
| 0 | 1 | no |
| 1 | 2 | no |
| 2 | 3 | no |
| 3 | 4 | no |
| 4 | 5 | no |
Action Insights
After heaving a look at the basic structure of all the datasets we have, let us try to merge the customer_bank_df, customer_campaign_df, customer_address_df, customer_demographics_df and customer_response_df based on the Customer_id column and then deep dive into its descriptive statistics
df = pd.merge(left = customer_bank_df, right = customer_campaign_df, on = 'Customer_id', how = 'left')
df = pd.merge(left = df, right = customer_demographics_df, on = 'Customer_id', how = 'left')
customer_address_df.rename(columns = {'customer_id':'Customer_id'}, inplace=True)
df = pd.merge(df, customer_address_df, on = 'Customer_id', how='left')
df = pd.merge(left = df, right = customer_response_df, on = 'Customer_id', how = 'left')
df.head()
| Customer_id | age | job | marital | education | default | housing | loan | Region_Code | State_Code | ... | pdays | previous | poutcome | emp.var.rate | cons.price.idx | cons.conf.idx | euribor3m | nr.employed | Postal Code | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 56 | services | married | high.school | no | no | yes | 3 | S1 | ... | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 42420 | no |
| 1 | 2 | 45 | services | married | basic.9y | unknown | no | no | 3 | S1 | ... | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 42420 | no |
| 2 | 3 | 59 | admin. | married | professional.course | no | no | no | 4 | S2 | ... | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 90036 | no |
| 3 | 4 | 41 | blue-collar | married | unknown | unknown | no | no | 3 | S3 | ... | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 33311 | no |
| 4 | 5 | 24 | technician | single | professional.course | no | yes | no | 3 | S3 | ... | 999 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 33311 | no |
5 rows × 26 columns
df.shape
(37084, 26)
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 37084 entries, 0 to 37083 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Customer_id 37084 non-null int64 1 age 37084 non-null int64 2 job 37084 non-null object 3 marital 37084 non-null object 4 education 37084 non-null object 5 default 37084 non-null object 6 housing 37084 non-null object 7 loan 37084 non-null object 8 Region_Code 37084 non-null object 9 State_Code 37084 non-null object 10 City_Code 37084 non-null object 11 contact 37084 non-null object 12 month 37084 non-null object 13 day_of_week 37084 non-null object 14 duration 37084 non-null int64 15 campaign 37084 non-null int64 16 pdays 37084 non-null int64 17 previous 37084 non-null int64 18 poutcome 37084 non-null object 19 emp.var.rate 37084 non-null float64 20 cons.price.idx 37084 non-null float64 21 cons.conf.idx 37084 non-null float64 22 euribor3m 37084 non-null float64 23 nr.employed 37084 non-null float64 24 Postal Code 37084 non-null int64 25 y 37084 non-null object dtypes: float64(5), int64(7), object(14) memory usage: 7.6+ MB
df.duplicated().sum()
0
Observations
- We have 37084 row and 26 colums in our dataset.
- All the columns do not seem to have any nan values in them. However, any abnormalities in the values in those columns have to be analysed as there might ne missing values in the format other than np.nan
- The data type of the Postal Code column should be object, because different postal codes are indicative of different locations and hence are categorical. The data types of all the other colums look correct and don't need to be changed.
- We have a total of 26 columns out of which 14 are categorical and 12 are numeric. Within numeric columns, 5 columns are of discrete type and the remaining 7 are of continuous type.
- Customer ID is a unique identifier for each customer, hence it should be removed in the final model building process.
- There are no duplicate entries in the dataset.
Converting Postal Code from Int64 to Object Type
df['Postal Code'] = df['Postal Code'].astype('object')
Descriptive Statistics of Numeric and Object Types
df.describe(include = ['int64','float64'])
| Customer_id | age | duration | campaign | pdays | previous | emp.var.rate | cons.price.idx | cons.conf.idx | euribor3m | nr.employed | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 |
| mean | 18542.500000 | 40.042714 | 258.237946 | 2.569545 | 962.530849 | 0.172986 | 0.082669 | 93.576076 | -40.505183 | 3.621668 | 5167.058664 |
| std | 10705.373028 | 10.432965 | 258.730909 | 2.770611 | 186.773063 | 0.495681 | 1.568997 | 0.578493 | 4.622045 | 1.733972 | 72.196605 |
| min | 1.000000 | 17.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | -3.400000 | 92.201000 | -50.800000 | 0.634000 | 4963.600000 |
| 25% | 9271.750000 | 32.000000 | 102.000000 | 1.000000 | 999.000000 | 0.000000 | -1.800000 | 93.075000 | -42.700000 | 1.344000 | 5099.100000 |
| 50% | 18542.500000 | 38.000000 | 180.000000 | 2.000000 | 999.000000 | 0.000000 | 1.100000 | 93.749000 | -41.800000 | 4.857000 | 5191.000000 |
| 75% | 27813.250000 | 47.000000 | 319.250000 | 3.000000 | 999.000000 | 0.000000 | 1.400000 | 93.994000 | -36.400000 | 4.961000 | 5228.100000 |
| max | 37084.000000 | 98.000000 | 4918.000000 | 56.000000 | 999.000000 | 7.000000 | 1.400000 | 94.767000 | -26.900000 | 5.045000 | 5228.100000 |
df.describe(include = ['object'])
| job | marital | education | default | housing | loan | Region_Code | State_Code | City_Code | contact | month | day_of_week | poutcome | Postal Code | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 37084 | 37084 | 37084 | 37084 | 37084 | 37084 | 37084 | 37084 | 37084 | 37084 | 37084 | 37084 | 37084 | 37084 | 37084 |
| unique | 12 | 4 | 8 | 3 | 3 | 3 | 5 | 49 | 531 | 2 | 10 | 5 | 3 | 631 | 2 |
| top | admin. | married | university.degree | no | yes | no | 4 | S2 | C21 | cellular | may | thu | nonexistent | 10035 | no |
| freq | 9420 | 22479 | 10971 | 29382 | 19433 | 30561 | 11884 | 7427 | 3422 | 23522 | 12420 | 7778 | 32023 | 974 | 32876 |
Observations
- The maximum age of the customer in the dataset is 98 while the 75th percentile is 47. So this might be an abnormal value which needs to be looked at ahead.
- The maximum duration seems to be 4918 while the 75th percentile is only 319.25. So this again indicates an outlier which needs to be looked at and treated accordingly.
- pdays column has most of the data as 999 which indicates that the client wasn't contacted at all. A better alternative of replacing such high indicators could be -1 as it would keep the distribution of the original data intact.
- Categorical columns need to be explored for any abnormal categories present in them.
- It is a good idea to change the case of all the columns to lowercase for the sake of uniformity & replace any periods with underscore.
Renaming the columns for uniformity¶
df.rename(columns = {'Customer_id':'customer_id',
'Region_Code':'region_code',
'State_Code':'state_code',
'City_Code':'city_code',
'Postal Code':'postal_code',
'emp.var.rate':'emp_var_rate',
'cons.price.idx':'cons_price_idx',
'cons.conf.idx':'cons_conf_idx',
'nr.employed':'nr_employed'}, inplace=True)
df.columns
Index(['customer_id', 'age', 'job', 'marital', 'education', 'default',
'housing', 'loan', 'region_code', 'state_code', 'city_code', 'contact',
'month', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous',
'poutcome', 'emp_var_rate', 'cons_price_idx', 'cons_conf_idx',
'euribor3m', 'nr_employed', 'postal_code', 'y'],
dtype='object')
Replacing 999 in days with -1
df['pdays'] = np.where(df['pdays']==999,-1,df['pdays'])
df['pdays'].unique()
array([-1, 6, 4, 3, 5, 1, 0, 10, 7, 8, 9, 11, 2, 12, 13, 14, 15,
16, 21, 17, 18, 22, 25, 26, 19, 27, 20], dtype=int64)
Analysing categorical columns¶
cat_cols = df.select_dtypes(['object']).columns
for col in cat_cols:
print("Column name:",col)
print("Unique categories:",df[col].unique())
print("Number of unique categories:",df[col].nunique())
print()
Column name: job Unique categories: ['services' 'admin.' 'blue-collar' 'technician' 'housemaid' 'retired' 'management' 'unknown' 'entrepreneur' 'unemployed' 'student' 'self-employed'] Number of unique categories: 12 Column name: marital Unique categories: ['married' 'single' 'divorced' 'unknown'] Number of unique categories: 4 Column name: education Unique categories: ['high.school' 'basic.9y' 'professional.course' 'unknown' 'basic.4y' 'basic.6y' 'university.degree' 'illiterate'] Number of unique categories: 8 Column name: default Unique categories: ['no' 'unknown' 'yes'] Number of unique categories: 3 Column name: housing Unique categories: ['no' 'yes' 'unknown'] Number of unique categories: 3 Column name: loan Unique categories: ['yes' 'no' 'unknown'] Number of unique categories: 3 Column name: region_code Unique categories: ['3' '4' 'Na' '1' '2'] Number of unique categories: 5 Column name: state_code Unique categories: ['S1' 'S2' 'S3' 'S4' 'S5' 'S6' 'S7' 'S8' 'S9' 'S10' 'S11' 'S12' 'S13' 'S14' 'S15' 'S16' 'S17' 'S18' 'S19' 'S20' 'S21' 'S22' 'S23' 'S24' 'S25' 'S26' 'S27' 'S28' 'S29' 'S30' 'S31' 'S32' 'S33' 'S34' 'S35' 'S36' 'S37' 'S38' 'S39' 'S40' 'S41' 'S42' 'S43' 'S44' 'S45' 'S46' 'S47' 'S48' 'S49'] Number of unique categories: 49 Column name: city_code Unique categories: ['C1' 'C2' 'C3' 'C4' 'C5' 'C6' 'C7' 'C8' 'C9' 'C10' 'C11' 'C12' 'C13' 'C14' 'C15' 'C16' 'C17' 'C18' 'C19' 'C20' 'C21' 'C22' 'C23' 'C24' 'C25' 'C26' 'C27' 'C28' 'C29' 'C30' 'C31' 'C32' 'C33' 'C34' 'C35' 'C36' 'C37' 'C38' 'C39' 'C40' 'C41' 'C42' 'C43' 'C44' 'C45' 'C46' 'C47' 'C48' 'C49' 'C50' 'C51' 'C52' 'C53' 'C54' 'C55' 'C56' 'C57' 'C58' 'C59' 'C60' 'C61' 'C62' 'C63' 'C64' 'C65' 'C66' 'C67' 'C68' 'C69' 'C70' 'C71' 'C72' 'C73' 'C74' 'C75' 'C76' 'C77' 'C78' 'C79' 'C80' 'C81' 'C82' 'C83' 'C84' 'C85' 'C86' 'C87' 'C88' 'C89' 'C90' 'C91' 'C92' 'C93' 'C94' 'C95' 'C96' 'C97' 'C98' 'C99' 'C100' 'C101' 'C102' 'C103' 'C104' 'C105' 'C106' 'C107' 'C108' 'C109' 'C110' 'C111' 'C112' 'C113' 'C114' 'C115' 'C116' 'C117' 'C118' 'C119' 'C120' 'C121' 'C122' 'C123' 'C124' 'C125' 'C126' 'C127' 'C128' 'C129' 'C130' 'C131' 'C132' 'C133' 'C134' 'C135' 'C136' 'C137' 'C138' 'C139' 'C140' 'C141' 'C142' 'C143' 'C144' 'C145' 'C146' 'C147' 'C148' 'C149' 'C150' 'C151' 'C152' 'C153' 'C154' 'C155' 'C156' 'C157' 'C158' 'C159' 'C160' 'C161' 'C162' 'C163' 'C164' 'C165' 'C166' 'C167' 'C168' 'C169' 'C170' 'C171' 'C172' 'C173' 'C174' 'C175' 'C176' 'C177' 'C178' 'C179' 'C180' 'C181' 'C182' 'C183' 'C184' 'C185' 'C186' 'C187' 'C188' 'C189' 'C190' 'C191' 'C192' 'C193' 'C194' 'C195' 'C196' 'C197' 'C198' 'C199' 'C200' 'C201' 'C202' 'C203' 'C204' 'C205' 'C206' 'C207' 'C208' 'C209' 'C210' 'C211' 'C212' 'C213' 'C214' 'C215' 'C216' 'C217' 'C218' 'C219' 'C220' 'C221' 'C222' 'C223' 'C224' 'C225' 'C226' 'C227' 'C228' 'C229' 'C230' 'C231' 'C232' 'C233' 'C234' 'C235' 'C236' 'C237' 'C238' 'C239' 'C240' 'C241' 'C242' 'C243' 'C244' 'C245' 'C246' 'C247' 'C248' 'C249' 'C250' 'C251' 'C252' 'C253' 'C254' 'C255' 'C256' 'C257' 'C258' 'C259' 'C260' 'C261' 'C262' 'C263' 'C264' 'C265' 'C266' 'C267' 'C268' 'C269' 'C270' 'C271' 'C272' 'C273' 'C274' 'C275' 'C276' 'C277' 'C278' 'C279' 'C280' 'C281' 'C282' 'C283' 'C284' 'C285' 'C286' 'C287' 'C288' 'C289' 'C290' 'C291' 'C292' 'C293' 'C294' 'C295' 'C296' 'C297' 'C298' 'C299' 'C300' 'C301' 'C302' 'C303' 'C304' 'C305' 'C306' 'C307' 'C308' 'C309' 'C310' 'C311' 'C312' 'C313' 'C314' 'C315' 'C316' 'C317' 'C318' 'C319' 'C320' 'C321' 'C322' 'C323' 'C324' 'C325' 'C326' 'C327' 'C328' 'C329' 'C330' 'C331' 'C332' 'C333' 'C334' 'C335' 'C336' 'C337' 'C338' 'C339' 'C340' 'C341' 'C342' 'C343' 'C344' 'C345' 'C346' 'C347' 'C348' 'C349' 'C350' 'C351' 'C352' 'C353' 'C354' 'C355' 'C356' 'C357' 'C358' 'C359' 'C360' 'C361' 'C362' 'C363' 'C364' 'C365' 'C366' 'C367' 'C368' 'C369' 'C370' 'C371' 'C372' 'C373' 'C374' 'C375' 'C376' 'C377' 'C378' 'C379' 'C380' 'C381' 'C382' 'C383' 'C384' 'C385' 'C386' 'C387' 'C388' 'C389' 'C390' 'C391' 'C392' 'C393' 'C394' 'C395' 'C396' 'C397' 'C398' 'C399' 'C400' 'C401' 'C402' 'C403' 'C404' 'C405' 'C406' 'C407' 'C408' 'C409' 'C410' 'C411' 'C412' 'C413' 'C414' 'C415' 'C416' 'C417' 'C418' 'C419' 'C420' 'C421' 'C422' 'C423' 'C424' 'C425' 'C426' 'C427' 'C428' 'C429' 'C430' 'C431' 'C432' 'C433' 'C434' 'C435' 'C436' 'C437' 'C438' 'C439' 'C440' 'C441' 'C442' 'C443' 'C444' 'C445' 'C446' 'C447' 'C448' 'C449' 'C450' 'C451' 'C452' 'C453' 'C454' 'C455' 'C456' 'C457' 'C458' 'C459' 'C460' 'C461' 'C462' 'C463' 'C464' 'C465' 'C466' 'C467' 'C468' 'C469' 'C470' 'C471' 'C472' 'C473' 'C474' 'C475' 'C476' 'C477' 'C478' 'C479' 'C480' 'C481' 'C482' 'C483' 'C484' 'C485' 'C486' 'C487' 'C488' 'C489' 'C490' 'C491' 'C492' 'C493' 'C494' 'C495' 'C496' 'C497' 'C498' 'C499' 'C500' 'C501' 'C502' 'C503' 'C504' 'C505' 'C506' 'C507' 'C508' 'C509' 'C510' 'C511' 'C512' 'C513' 'C514' 'C515' 'C516' 'C517' 'C518' 'C519' 'C520' 'C521' 'C522' 'C523' 'C524' 'C525' 'C526' 'C527' 'C528' 'C529' 'C530' 'C531'] Number of unique categories: 531 Column name: contact Unique categories: ['telephone' 'cellular'] Number of unique categories: 2 Column name: month Unique categories: ['may' 'jun' 'jul' 'aug' 'oct' 'nov' 'dec' 'mar' 'apr' 'sep'] Number of unique categories: 10 Column name: day_of_week Unique categories: ['mon' 'tue' 'wed' 'thu' 'fri'] Number of unique categories: 5 Column name: poutcome Unique categories: ['nonexistent' 'failure' 'success'] Number of unique categories: 3 Column name: postal_code Unique categories: [42420 90036 33311 90032 28027 98103 76106 53711 84084 94109 68025 19140 84057 90049 77095 75080 77041 60540 32935 55122 48185 19901 47150 10024 12180 90004 60610 85234 22153 10009 49201 38109 77070 35601 94122 27707 60623 29203 55901 55407 97206 55106 80013 28205 60462 10035 50322 43229 37620 19805 61701 85023 95661 64055 91104 43055 53132 85254 95123 98105 98115 73034 90045 19134 88220 78207 77036 62521 71203 6824 75051 92374 45011 7090 19120 44312 80219 75220 37064 90604 48601 44256 43017 48227 38401 33614 95051 55044 92037 77506 94513 27514 7960 45231 94110 90301 33319 80906 7109 48180 8701 22204 80004 7601 33710 19143 90805 92345 37130 84041 78745 1852 31907 6040 78550 85705 62301 2038 33024 98198 61604 89115 2886 33180 28403 92646 40475 80027 1841 39212 48187 10801 28052 32216 47201 13021 73071 94521 60068 79109 11757 90008 92024 77340 14609 72701 92627 80134 30318 64118 59405 48234 33801 36116 85204 60653 54302 45503 92804 98270 97301 78041 75217 43123 10011 48126 31088 94591 92691 48307 7060 85635 98661 60505 76017 40214 75081 44105 75701 27217 22980 19013 27511 32137 10550 48205 33012 11572 92105 60201 48183 55016 71111 50315 93534 23223 28806 92530 68104 98026 92704 53209 41042 44052 7036 93905 8901 17602 3301 21044 75043 6360 22304 43615 87401 92503 90503 78664 92054 33433 23464 92563 28540 52601 98502 20016 65109 63376 61107 33142 78521 10701 94601 28110 20735 30076 72401 47374 94509 33030 46350 48911 44221 89502 22801 92025 48073 20852 33065 14215 33437 39503 93727 27834 11561 35630 31204 52402 2908 81001 94533 32725 42071 6457 11520 90660 84604 84062 30080 24153 44134 36608 2740 75061 8360 85301 14304 27360 92683 38301 75019 91767 89031 18103 19711 85281 92677 8302 2149 13601 54915 98006 75002 79907 76051 75007 37167 98031 70506 97224 60076 75023 23434 46203 7002 28314 27405 21215 53142 66062 98002 74133 97756 27604 74403 6450 42104 46614 6010 89015 99207 76248 45014 32127 97504 22901 59801 33178 29501 97477 32712 19601 80020 65807 7501 73120 23320 79424 65203 37604 36830 92404 1453 59715 85345 44107 8861 91761 91730 56560 75150 95207 32174 94086 3820 17403 77840 63116 2169 95336 44240 76903 84106 35810 37918 72209 48146 43302 80122 5408 4401 38671 47362 48640 57103 80525 47905 37042 95823 91360 2148 1040 87105 89431 92236 60126 7055 29406 23602 14701 46544 43402 92253 32303 37211 98226 60098 76117 60090 29483 71901 80112 43130 88001 35244 75034 95687 84107 53186 93309 33068 45373 78415 90278 32839 7050 70601 60035 11550 46060 55124 29464 48310 54703 78577 59102 97030 37421 83642 92307 60440 55369 95695 77489 77581 94403 49505 93277 66212 92592 92399 2151 77301 60477 52001 48127 87505 28601 60188 56301 33161 46226 33317 34952 29730 79762 53214 91911 66502 16602 80229 61821 47401 71854 78539 77520 46142 90712 2895 54880 76021 98042 74012 33023 33021 77536 67212 78501 52240 83704 2920 61032 77642 95610 75056 98052 32114 86442 46368 58103 46514 91776 33063 30328 44060 73505 23666 13440 54601 83501 39401 94526 48858 84321 6708 30605 4240 61832 85323 30062 85364 54401 99301 60302 32503 77573 20877 84043 35401 92553 40324 80538 85224 59601 63122 76706 48066 60423 18018 55113 68801 55125 48237 72756 88101 33458 93101 75104 68701 84020 48104 91941 83201 49423 6460 60089 92630 96003 95928 13501 72032 82001 42301 83605 70065 3060 38134 94061 37087 93454 60016 98632 37075 50701 2138 60067 1915 97405 93030 98059 60025 33445 80022 77590 27893 87124 27534 98208 90640 92020 77705 33407 79605 61761 63301 60174 93010 97123 91505 95351 67846 8401 80501 95616 26003 95037 7011 53081 30344 57701 1810 34741 6484 6810 52302 32771 78666 80634 76063 44035 83301 33134 60441 1752 20707 77803 71603 57401 21740 7017 60004 60543 55433 92672 94568 93405 72762 95240 77571 45040 30188] Number of unique categories: 631 Column name: y Unique categories: ['no' 'yes'] Number of unique categories: 2
Observations
- Columns of job, marital, education, default, housing, loan seem to have a category 'unknown' which needs further analysis
- region_code seems to have 'Na' in the category which needs to be explored. One way would be to map these Na values to their corresponding regions using the information of corresponding state_code
- poutcome has 'nonexistent' as one of the categories which need further analysis
- All other categorical columns seem to have correct number of categories in them
df['job'] = df['job'].replace('admin.','admin')
Filling any 'Na' in region_code with corresponding regions¶
def get_state_region(state):
region = state_master_df.loc[state_master_df.State_Code == state]['Region_Code'].values[0]
return region
df.region_code = df.state_code.map(lambda x:get_state_region(x))
df.region_code = df.region_code.astype('object')
df
| customer_id | age | job | marital | education | default | housing | loan | region_code | state_code | ... | pdays | previous | poutcome | emp_var_rate | cons_price_idx | cons_conf_idx | euribor3m | nr_employed | postal_code | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 56 | services | married | high.school | no | no | yes | 3 | S1 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 42420 | no |
| 1 | 2 | 45 | services | married | basic.9y | unknown | no | no | 3 | S1 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 42420 | no |
| 2 | 3 | 59 | admin | married | professional.course | no | no | no | 4 | S2 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 90036 | no |
| 3 | 4 | 41 | blue-collar | married | unknown | unknown | no | no | 3 | S3 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 33311 | no |
| 4 | 5 | 24 | technician | single | professional.course | no | yes | no | 3 | S3 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 33311 | no |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 37079 | 37080 | 73 | retired | married | professional.course | no | yes | no | 2 | S16 | ... | -1 | 0 | nonexistent | -1.1 | 94.767 | -50.8 | 1.028 | 4963.6 | 10009 | yes |
| 37080 | 37081 | 46 | blue-collar | married | professional.course | no | no | no | 2 | S16 | ... | -1 | 0 | nonexistent | -1.1 | 94.767 | -50.8 | 1.028 | 4963.6 | 10011 | no |
| 37081 | 37082 | 56 | retired | married | university.degree | no | yes | no | 2 | S16 | ... | -1 | 0 | nonexistent | -1.1 | 94.767 | -50.8 | 1.028 | 4963.6 | 10009 | no |
| 37082 | 37083 | 44 | technician | married | professional.course | no | no | no | 4 | S17 | ... | -1 | 0 | nonexistent | -1.1 | 94.767 | -50.8 | 1.028 | 4963.6 | 85254 | yes |
| 37083 | 37084 | 74 | retired | married | professional.course | no | yes | no | 1 | S6 | ... | -1 | 1 | failure | -1.1 | 94.767 | -50.8 | 1.028 | 4963.6 | 79109 | no |
37084 rows × 26 columns
Exploring unknown values¶
df.loc[df.job == 'unknown']
| customer_id | age | job | marital | education | default | housing | loan | region_code | state_code | ... | pdays | previous | poutcome | emp_var_rate | cons_price_idx | cons_conf_idx | euribor3m | nr_employed | postal_code | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 23 | 24 | 55 | unknown | married | university.degree | unknown | unknown | unknown | 2 | S10 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 19140 | no |
| 28 | 29 | 55 | unknown | married | basic.4y | unknown | yes | no | 2 | S10 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 19140 | no |
| 66 | 67 | 57 | unknown | married | unknown | unknown | no | no | 1 | S11 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 60610 | no |
| 82 | 83 | 57 | unknown | married | unknown | unknown | yes | no | 4 | S2 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 94122 | no |
| 134 | 135 | 38 | unknown | divorced | high.school | unknown | yes | no | 4 | S2 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 95661 | no |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 36353 | 36354 | 59 | unknown | married | unknown | no | no | no | 2 | S16 | ... | 6 | 1 | success | -1.7 | 94.027 | -38.3 | 0.900 | 4991.6 | 10009 | yes |
| 36403 | 36404 | 64 | unknown | married | unknown | no | yes | no | 1 | S27 | ... | -1 | 0 | nonexistent | -1.7 | 94.027 | -38.3 | 0.905 | 4991.6 | 74133 | yes |
| 36608 | 36609 | 67 | unknown | divorced | unknown | unknown | yes | no | 2 | S16 | ... | 6 | 2 | success | -1.1 | 94.199 | -37.5 | 0.880 | 4963.6 | 10024 | yes |
| 36920 | 36921 | 63 | unknown | married | professional.course | no | no | no | 4 | S5 | ... | 6 | 1 | success | -1.1 | 94.601 | -49.5 | 1.025 | 4963.6 | 98105 | no |
| 37008 | 37009 | 64 | unknown | married | unknown | no | no | no | 2 | S16 | ... | 3 | 3 | success | -1.1 | 94.767 | -50.8 | 1.048 | 4963.6 | 10024 | no |
306 rows × 26 columns
306/37084 *100
0.8251537051019309
df.loc[df.marital == 'unknown']
| customer_id | age | job | marital | education | default | housing | loan | region_code | state_code | ... | pdays | previous | poutcome | emp_var_rate | cons_price_idx | cons_conf_idx | euribor3m | nr_employed | postal_code | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 33 | 34 | 58 | management | unknown | university.degree | no | yes | no | 2 | S10 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 19140 | no |
| 346 | 347 | 59 | retired | unknown | university.degree | unknown | no | no | 2 | S32 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 1852 | no |
| 367 | 368 | 57 | retired | unknown | basic.4y | no | no | no | 2 | S30 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 6040 | no |
| 1334 | 1335 | 33 | services | unknown | high.school | no | yes | yes | 4 | S2 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.855 | 5191.0 | 90049 | no |
| 1435 | 1436 | 32 | unknown | unknown | university.degree | no | no | no | 2 | S25 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.855 | 5191.0 | 44134 | yes |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 34699 | 34700 | 31 | entrepreneur | unknown | university.degree | no | no | no | 2 | S40 | ... | -1 | 0 | nonexistent | -3.4 | 92.431 | -26.9 | 0.722 | 5017.5 | 21215 | yes |
| 34726 | 34727 | 31 | entrepreneur | unknown | university.degree | no | yes | no | 1 | S12 | ... | -1 | 1 | failure | -3.4 | 92.431 | -26.9 | 0.722 | 5017.5 | 55901 | no |
| 35393 | 35394 | 30 | technician | unknown | university.degree | no | no | no | 4 | S28 | ... | 11 | 2 | success | -1.8 | 93.369 | -34.8 | 0.639 | 5008.7 | 88101 | yes |
| 35394 | 35395 | 30 | technician | unknown | university.degree | no | no | no | 4 | S28 | ... | 6 | 2 | success | -1.8 | 93.369 | -34.8 | 0.639 | 5008.7 | 88101 | yes |
| 36791 | 36792 | 30 | technician | unknown | university.degree | no | no | no | 2 | S10 | ... | 16 | 1 | success | -1.1 | 94.601 | -49.5 | 0.942 | 4963.6 | 19140 | no |
72 rows × 26 columns
72/37084*100
0.19415381296516018
df.loc[df.education == 'unknown']
| customer_id | age | job | marital | education | default | housing | loan | region_code | state_code | ... | pdays | previous | poutcome | emp_var_rate | cons_price_idx | cons_conf_idx | euribor3m | nr_employed | postal_code | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 4 | 41 | blue-collar | married | unknown | unknown | no | no | 3 | S3 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 33311 | no |
| 6 | 7 | 41 | blue-collar | married | unknown | unknown | no | no | 4 | S2 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 90032 | no |
| 21 | 22 | 59 | technician | married | unknown | no | yes | no | 1 | S9 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 68025 | no |
| 24 | 25 | 46 | admin | married | unknown | no | no | no | 4 | S8 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 84057 | no |
| 25 | 26 | 49 | blue-collar | married | unknown | no | no | no | 4 | S2 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 90049 | no |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 37016 | 37017 | 34 | technician | married | unknown | no | yes | no | 2 | S16 | ... | -1 | 2 | failure | -1.1 | 94.767 | -50.8 | 1.046 | 4963.6 | 10011 | no |
| 37018 | 37019 | 60 | admin | married | unknown | no | no | no | 3 | S3 | ... | -1 | 0 | nonexistent | -1.1 | 94.767 | -50.8 | 1.046 | 4963.6 | 33801 | no |
| 37020 | 37021 | 34 | technician | married | unknown | no | no | no | 3 | S33 | ... | -1 | 0 | nonexistent | -1.1 | 94.767 | -50.8 | 1.046 | 4963.6 | 30080 | yes |
| 37033 | 37034 | 54 | technician | married | unknown | no | yes | no | 4 | S2 | ... | -1 | 1 | failure | -1.1 | 94.767 | -50.8 | 1.041 | 4963.6 | 92804 | no |
| 37073 | 37074 | 34 | student | single | unknown | no | yes | no | 3 | S1 | ... | -1 | 2 | failure | -1.1 | 94.767 | -50.8 | 1.031 | 4963.6 | 42420 | no |
1549 rows × 26 columns
1549/37084*100
4.177003559486571
df.loc[df.default == 'unknown']
| customer_id | age | job | marital | education | default | housing | loan | region_code | state_code | ... | pdays | previous | poutcome | emp_var_rate | cons_price_idx | cons_conf_idx | euribor3m | nr_employed | postal_code | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 45 | services | married | basic.9y | unknown | no | no | 3 | S1 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 42420 | no |
| 3 | 4 | 41 | blue-collar | married | unknown | unknown | no | no | 3 | S3 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 33311 | no |
| 6 | 7 | 41 | blue-collar | married | unknown | unknown | no | no | 4 | S2 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 90032 | no |
| 11 | 12 | 54 | retired | married | basic.9y | unknown | yes | yes | 4 | S2 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 90032 | no |
| 13 | 14 | 46 | blue-collar | married | basic.6y | unknown | yes | yes | 4 | S5 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 98103 | no |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 36817 | 36818 | 70 | retired | married | basic.4y | unknown | no | no | 4 | S2 | ... | 14 | 3 | failure | -1.1 | 94.601 | -49.5 | 0.965 | 4963.6 | 91360 | no |
| 36865 | 36866 | 49 | admin | married | high.school | unknown | no | no | 4 | S5 | ... | 6 | 3 | success | -1.1 | 94.601 | -49.5 | 0.985 | 4963.6 | 98105 | yes |
| 36866 | 36867 | 66 | retired | married | basic.4y | unknown | unknown | unknown | 1 | S12 | ... | 6 | 3 | success | -1.1 | 94.601 | -49.5 | 0.985 | 4963.6 | 55407 | yes |
| 36891 | 36892 | 84 | retired | divorced | basic.4y | unknown | yes | no | 2 | S25 | ... | -1 | 0 | nonexistent | -1.1 | 94.601 | -49.5 | 0.993 | 4963.6 | 43229 | no |
| 36904 | 36905 | 84 | retired | divorced | basic.4y | unknown | yes | yes | 3 | S20 | ... | 3 | 1 | success | -1.1 | 94.601 | -49.5 | 1.000 | 4963.6 | 36830 | no |
7700 rows × 26 columns
7700/37084*100
20.76367166432963
df.loc[df.housing == 'unknown']
| customer_id | age | job | marital | education | default | housing | loan | region_code | state_code | ... | pdays | previous | poutcome | emp_var_rate | cons_price_idx | cons_conf_idx | euribor3m | nr_employed | postal_code | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 23 | 24 | 55 | unknown | married | university.degree | unknown | unknown | unknown | 2 | S10 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 19140 | no |
| 73 | 74 | 51 | blue-collar | married | basic.4y | unknown | unknown | unknown | 3 | S19 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 38109 | no |
| 236 | 237 | 27 | blue-collar | single | basic.6y | no | unknown | unknown | 4 | S2 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 95051 | no |
| 341 | 342 | 59 | management | married | basic.4y | unknown | unknown | unknown | 2 | S10 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 19140 | no |
| 356 | 357 | 48 | services | married | high.school | unknown | unknown | unknown | 2 | S16 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 10035 | no |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 36866 | 36867 | 66 | retired | married | basic.4y | unknown | unknown | unknown | 1 | S12 | ... | 6 | 3 | success | -1.1 | 94.601 | -49.5 | 0.985 | 4963.6 | 55407 | yes |
| 36907 | 36908 | 31 | admin | single | university.degree | no | unknown | unknown | 2 | S16 | ... | 3 | 1 | success | -1.1 | 94.601 | -49.5 | 1.008 | 4963.6 | 10035 | yes |
| 36929 | 36930 | 34 | admin | divorced | university.degree | no | unknown | unknown | 1 | S11 | ... | -1 | 1 | failure | -1.1 | 94.601 | -49.5 | 1.025 | 4963.6 | 60505 | no |
| 36941 | 36942 | 46 | management | married | university.degree | no | unknown | unknown | 4 | S2 | ... | 12 | 2 | failure | -1.1 | 94.601 | -49.5 | 1.032 | 4963.6 | 94110 | no |
| 37013 | 37014 | 42 | services | divorced | university.degree | no | unknown | unknown | 2 | S16 | ... | -1 | 1 | failure | -1.1 | 94.767 | -50.8 | 1.046 | 4963.6 | 10011 | yes |
882 rows × 26 columns
882/37084*100
2.378384208823212
df.loc[df.loan == 'unknown']
| customer_id | age | job | marital | education | default | housing | loan | region_code | state_code | ... | pdays | previous | poutcome | emp_var_rate | cons_price_idx | cons_conf_idx | euribor3m | nr_employed | postal_code | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 23 | 24 | 55 | unknown | married | university.degree | unknown | unknown | unknown | 2 | S10 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 19140 | no |
| 73 | 74 | 51 | blue-collar | married | basic.4y | unknown | unknown | unknown | 3 | S19 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 38109 | no |
| 236 | 237 | 27 | blue-collar | single | basic.6y | no | unknown | unknown | 4 | S2 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 95051 | no |
| 341 | 342 | 59 | management | married | basic.4y | unknown | unknown | unknown | 2 | S10 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 19140 | no |
| 356 | 357 | 48 | services | married | high.school | unknown | unknown | unknown | 2 | S16 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 10035 | no |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 36866 | 36867 | 66 | retired | married | basic.4y | unknown | unknown | unknown | 1 | S12 | ... | 6 | 3 | success | -1.1 | 94.601 | -49.5 | 0.985 | 4963.6 | 55407 | yes |
| 36907 | 36908 | 31 | admin | single | university.degree | no | unknown | unknown | 2 | S16 | ... | 3 | 1 | success | -1.1 | 94.601 | -49.5 | 1.008 | 4963.6 | 10035 | yes |
| 36929 | 36930 | 34 | admin | divorced | university.degree | no | unknown | unknown | 1 | S11 | ... | -1 | 1 | failure | -1.1 | 94.601 | -49.5 | 1.025 | 4963.6 | 60505 | no |
| 36941 | 36942 | 46 | management | married | university.degree | no | unknown | unknown | 4 | S2 | ... | 12 | 2 | failure | -1.1 | 94.601 | -49.5 | 1.032 | 4963.6 | 94110 | no |
| 37013 | 37014 | 42 | services | divorced | university.degree | no | unknown | unknown | 2 | S16 | ... | -1 | 1 | failure | -1.1 | 94.767 | -50.8 | 1.046 | 4963.6 | 10011 | yes |
882 rows × 26 columns
882/37084*100
2.378384208823212
df.poutcome.value_counts()
nonexistent 32023 failure 3822 success 1239 Name: poutcome, dtype: int64
Action Insights
- Since the percentage of unknown values in the columns of job, marital, education, housing and loan are very less (<5%) of the total number of data points, we can successfully impute these unknowns with the mode of these columns.
- Default column seems to have approximately 20% of the data points as unknowns. So, rather than imputing them by mode, we can consider them as a separate category at the time of model building.
- poutcome column also seems to have a nonexistent category which is the largest in number. This is actually possible as the persons who weren't contacted in the previous campaign will not have any success or failure outcome corresponsing to them. However, we need to check that the value of previous column corresponding to these nonexistent values are 0 or not.
Checking for the previous column values corresponding to poutcome = nonexistent¶
df.loc[df.poutcome=='nonexistent']['previous'].sum()
0
Missing Value Treatment¶
Imputing Unknowns with Mode in columns except for Default¶
df.job[df.job=='unknown'] = np.nan
df.marital[df.marital=='unknown'] = np.nan
df.education[df.education=='unknown'] = np.nan
df.housing[df.housing=='unknown'] = np.nan
df.loan[df.loan=='unknown'] = np.nan
df.job.fillna(df.job.mode()[0],inplace=True)
df.marital.fillna(df.marital.mode()[0],inplace=True)
df.education.fillna(df.education.mode()[0],inplace=True)
df.housing.fillna(df.housing.mode()[0],inplace=True)
df.loan.fillna(df.loan.mode()[0],inplace=True)
df.shape
(37084, 26)
Analysis of Percentage Turnout of Marketing Campaign¶
percent_turnout = df['y'].value_counts(normalize=True)*100
percent_turnout = pd.DataFrame(percent_turnout)
percent_turnout.columns = ['response']
percent_turnout
| response | |
|---|---|
| no | 88.652788 |
| yes | 11.347212 |
percent_turnout.plot(kind='bar')
Right Mode to contact the customers (telephone or mobile)¶
contact_percent_df = pd.crosstab(df.loc[df.y=='yes']['y'],df.contact,normalize=True)*100
contact_percent_df.index.name = 'response'
contact_percent_df
| contact | cellular | telephone |
|---|---|---|
| response | ||
| yes | 82.984791 | 17.015209 |
contact_percent_df.plot(kind='bar')
Attempts made to turn a person to a successful depositor¶
attempts_made_df = pd.crosstab(df.loc[df.y=='yes']['y'],df.campaign,normalize=True)*100
attempts_made_df.index.name = 'response'
attempts_made_df
| campaign | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 17 | 23 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| response | |||||||||||||||||
| yes | 49.6673 | 25.998099 | 12.238593 | 5.584601 | 2.614068 | 1.639734 | 0.784221 | 0.3327 | 0.308935 | 0.237643 | 0.261407 | 0.071293 | 0.071293 | 0.023764 | 0.047529 | 0.095057 | 0.023764 |
attempts_made_df.plot(kind='bar')
attempts_made_df_prev = pd.crosstab(df.loc[df.y=='yes']['y'],df.previous,normalize=True)*100
attempts_made_df_prev.index.name = 'response'
attempts_made_df_prev
| previous | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|---|
| response | |||||||
| yes | 67.609316 | 20.912548 | 7.604563 | 2.685361 | 0.831749 | 0.285171 | 0.071293 |
attempts_made_df_prev.plot(kind='bar')
Observations
- It seems that the percentage turnout of the marketing campaign is roughly 11%, which indicates that the campaign is not that successful in converting majority of the target customers towards term deposits.
- Amongst the customers which actually agreed to place a deposit, roughly 80% of them were contacted by mobile. Hence mobile/cellular is a more preferred mode of contacting the customers. This might also be because of the fact that telephone usage have been diminished with the advancements in technologies and hence most customers might not posses those.
- Amongst the customers which actually agreed to place a deposit, almost 88% of them were contacted 3 times or less for the current campaign scheme. Hence, majority of the customers who were convinced for the scheme required 3 or less attempts.
- Amongst the customers which were convinced during the previous campaign, 95% of them were contacted 2 times or less. Hence, increase in the total attempts beyond 3 will not necessarily drive more customers towards the scheme.
- Detailed analysis of all the hypothesis will be done in Univariate and Bivariate Analysis through various plots and statistical tests.
Step 2: Exploratory Data Analysis and Hypothesis Testing¶
Univariate Analysis¶
1. Numerical Features¶
df.dtypes
customer_id int64 age int64 job object marital object education object default object housing object loan object region_code object state_code object city_code object contact object month object day_of_week object duration int64 campaign int64 pdays int64 previous int64 poutcome object emp_var_rate float64 cons_price_idx float64 cons_conf_idx float64 euribor3m float64 nr_employed float64 postal_code object y object dtype: object
numerical_df = df.select_dtypes(['int64','float64'])
numerical_df.dtypes
customer_id int64 age int64 duration int64 campaign int64 pdays int64 previous int64 emp_var_rate float64 cons_price_idx float64 cons_conf_idx float64 euribor3m float64 nr_employed float64 dtype: object
numerical_df.describe()
| customer_id | age | duration | campaign | pdays | previous | emp_var_rate | cons_price_idx | cons_conf_idx | euribor3m | nr_employed | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 |
| mean | 18542.500000 | 40.042714 | 258.237946 | 2.569545 | -0.741721 | 0.172986 | 0.082669 | 93.576076 | -40.505183 | 3.621668 | 5167.058664 |
| std | 10705.373028 | 10.432965 | 258.730909 | 2.770611 | 1.515288 | 0.495681 | 1.568997 | 0.578493 | 4.622045 | 1.733972 | 72.196605 |
| min | 1.000000 | 17.000000 | 0.000000 | 1.000000 | -1.000000 | 0.000000 | -3.400000 | 92.201000 | -50.800000 | 0.634000 | 4963.600000 |
| 25% | 9271.750000 | 32.000000 | 102.000000 | 1.000000 | -1.000000 | 0.000000 | -1.800000 | 93.075000 | -42.700000 | 1.344000 | 5099.100000 |
| 50% | 18542.500000 | 38.000000 | 180.000000 | 2.000000 | -1.000000 | 0.000000 | 1.100000 | 93.749000 | -41.800000 | 4.857000 | 5191.000000 |
| 75% | 27813.250000 | 47.000000 | 319.250000 | 3.000000 | -1.000000 | 0.000000 | 1.400000 | 93.994000 | -36.400000 | 4.961000 | 5228.100000 |
| max | 37084.000000 | 98.000000 | 4918.000000 | 56.000000 | 27.000000 | 7.000000 | 1.400000 | 94.767000 | -26.900000 | 5.045000 | 5228.100000 |
group1 = ['age','duration','pdays']
group2 = ['campaign','previous']
group3 = ['euribor3m','nr_employed','emp_var_rate']
group4 = ['cons_price_idx','cons_conf_idx']
def UVA_KDE(data, var_group):
size = len(var_group)
plt.figure(figsize = (7*size,3), dpi = 100)
for i,var in enumerate(var_group):
mini = data[var].min()
maxi = data[var].max()
ran = data[var].max()-data[var].min()
mean = data[var].mean()
median = data[var].median()
st_dev = data[var].std()
skew = data[var].skew()
kurt = data[var].kurtosis()
points = mean-st_dev, mean+st_dev
plt.subplot(1,size,i+1)
sns.kdeplot(data[var], shade=True)
sns.lineplot(points, [0,0], color = 'black', label = "std_dev")
sns.scatterplot([mini,maxi], [0,0], color = 'orange', label = "min/max")
sns.scatterplot([mean], [0], color = 'red', label = "mean")
sns.scatterplot([median], [0], color = 'blue', label = "median")
plt.xlabel('{}'.format(var), fontsize = 20)
plt.ylabel('density')
plt.title('kurtosis = {};\nskew = {}; range = {}\nmean = {}; median = {}'.format(round(kurt,2),round(skew,2),
round(ran,2),
round(mean,2),
round(median,2)))
UVA_KDE(numerical_df,group1)
UVA_KDE(numerical_df,group2)
UVA_KDE(numerical_df,group3)
UVA_KDE(numerical_df,group4)
Observations
- GROUP1
- age varaiable is a continuous numerical variable with a mean age of 40. But its distribution is slighly right skewed, indicative of presence of some large values(>65) which can be outliers and need to be analysed further using boxplots and treated accordingly.
- duration variable is also a continuous numerical variable with a mean duration of 258. But its distribution is also right skewed indicative of the presence of some large outliers which need to be analysed further ausing boxplots and treated.
- pdays is a discrete numeric column and the kde plot indicates that most of the customers are contaced for the first time during the current campaign.
- GROUP2
- campaign is a discrete numeric column with a median of 2 which indicates that 50% of the customers were contacted 2 times or less during the current campaign. However, this column still has some outlier values which can be analysed using boxplots and treated accordingly.
- previous column is also discrete numerical and the distribution indicates that most of the customers were not contacted at all previously.
- GROUP3
- euriobr3m which is a daily indicator seems to have a bimodal distribution with majority of the customers having a value of 1 and 5.
- nr_employed which is a quarterly indicator has multimodal distribution with majority of customers having a score of 5100 and 5200 (higher values) indicative of high proportion of workforce involved in jobs and services.
- emp_var_rate seems to have a multimodal distribution with majority of customers having a value of 1.5 or -2 indicative of the fact that we have high proportion of customers from both currently employed or actively seeking employment and those who are unemployed or retired.
- GROUP4
- consumer price index has a multimodal distribution with majority of the index numbers towards the higher end of the spectrum indicating of surge in the prices for availing bank services over time.
- consumer confidence index also has a multimodal distribution with the majority index numbers are large negatives indicating that the customers are generally pessimistic of their financial situation. This might be one of the reasons why these customers show interests in fixed deposits schemes as a safe mode of investment.
import stats
def UVA_boxplot(data, var_group):
size = len(var_group)
plt.figure(figsize = (7*size,4), dpi = 100)
from scipy import stats
for i,var in enumerate(var_group):
quant25 = data[var].quantile(0.25)
quant75 = data[var].quantile(0.75)
IQR = quant75 - quant25
med = data[var].median()
whis_low = quant25-(1.5*IQR)
whis_high = quant75+(1.5*IQR)
skew = data[var].skew()
mini = data[var].min()
maxi = data[var].max()
kurt = data[var].kurtosis()
shapiro_test = stats.shapiro(data[var])
outlier_high = len(data[var][data[var]>whis_high])
outlier_low = len(data[var][data[var]<whis_low])
plt.subplot(1,size,i+1)
sns.boxplot(data[var], orient="v")
plt.ylabel('{}'.format(var))
plt.title('Skew = {}; Range = {} \n 1st, 2nd, 3rd quartile = {};\n Whiskers = {}; kurtosis = {} ; \n Number of Outlier (low/high) = {} '.format(
round(skew,2),
(round(mini,2),round(maxi,2)),
(round(quant25,2),round(med,2),round(quant75,2)),
(round(whis_low,2), round(whis_high,2)),
(round(kurt,2)),
(outlier_low,outlier_high)
))
UVA_boxplot(numerical_df,group1)
UVA_boxplot(numerical_df,group2)
UVA_boxplot(numerical_df,group3)
UVA_boxplot(numerical_df,group4)
Observations
- age column seems to have some outliers i.e. values > 70 which is a bit odd. So these values can be clipped and made to 70.
- duration and campaign columns also have certain outliers. However, these values should be checked for correlation. If the correlation is high, then these columns do not need any outlier treatment. However, if the correlation values are very low, then these values will be clipped to their highest whisker values.
- pdays and previous columns have outliers. But these don't need any treatment as these indicate only the number of times customers were contacted previously.
- cons_conf_idx also has some small number of outliers. But the difference between the highest whisker value and the outlier values is very small. Hence, they will be left unchanged.
2. Categorical Features
categorical_df = df.select_dtypes(['object'])
categorical_df.dtypes
job object marital object education object default object housing object loan object region_code object state_code object city_code object contact object month object day_of_week object poutcome object postal_code object y object dtype: object
categorical_df.describe()
| job | marital | education | default | housing | loan | region_code | state_code | city_code | contact | month | day_of_week | poutcome | postal_code | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 37084 | 37084 | 37084 | 37084 | 37084 | 37084 | 37084 | 37084 | 37084 | 37084 | 37084 | 37084 | 37084 | 37084 | 37084 |
| unique | 11 | 3 | 7 | 3 | 2 | 2 | 4 | 49 | 531 | 2 | 10 | 5 | 3 | 631 | 2 |
| top | admin | married | university.degree | no | yes | no | 4 | S2 | C21 | cellular | may | thu | nonexistent | 10035 | no |
| freq | 9726 | 22551 | 12520 | 29382 | 20315 | 31443 | 11914 | 7427 | 3422 | 23522 | 12420 | 7778 | 32023 | 974 | 32876 |
def UVA_category(data, var_group):
size = len(var_group)
plt.figure(figsize = (7*size,7), dpi = 100)
for i,var in enumerate(var_group):
norm_count = round(data[var].value_counts(normalize = True)*100,2)
n_uni = data[var].nunique()
sns.set_style('darkgrid')
plt.subplot(1,size,i+1)
sns.barplot(norm_count, norm_count.index , order = norm_count.index)
plt.xlabel('fraction/percent', fontsize = 20)
plt.ylabel('{}'.format(var), fontsize = 20)
plt.title('n_uniques = {} \n value counts \n {};'.format(n_uni,norm_count))
group1 = ['job','marital','education']
group2 = ['default','housing','loan']
group3 = ['contact','month','day_of_week']
group4 = ['poutcome','y']
UVA_category(categorical_df,group1)
UVA_category(categorical_df,group2)
UVA_category(categorical_df,group3)
UVA_category(categorical_df,group4)
Observations¶
- Almost 75% of the customers are involved in either administrative, blue-collar, services or technician job.
- Almost 88% of the customers are married or single.
- Almost 82% of the customers have a university degree, high school degree, basic_9y education or professional course completion. Hence majority of the customers are well educated.
- Almost 80% of the customers have not defaulted in the credits in past which is a good sign for the bank.
- Housing loans are taken by approximately 50% of the customers.
- 80% of the customers don't have any personal loans.
- Almost 65% of the customers have cellular as their category of contact for the current campaign.
- Almost 80% of the campaigns seems to be running in the months of may, june, july and august for the bank.
- Day of the week seem to be equally distributed across all the working days indicating that the customers are being reached on all working days equally.
- As 85% of the customers haven't been contacted previously, they don't have any details corresponding to the outcome of the previous campaign. This might be because these customers are newly engaged with the bank so, the bank must focus on exciting schemes in the future too in order to retain them.
- The campaign doesn't seem to be successful as majority of the customers haven't invested in the term deposit policies. The bank can either increase their interiest rates for this policy or also try to approach these customers through other media channels like digital platforms, personalized advertisements, sms, emails etc. in future.
Bivariate Analysis¶
1. Numerical-Numerical¶
numerical_df.dtypes
customer_id int64 age int64 duration int64 campaign int64 pdays int64 previous int64 emp_var_rate float64 cons_price_idx float64 cons_conf_idx float64 euribor3m float64 nr_employed float64 dtype: object
numerical_df = numerical_df.drop('customer_id',axis=1)
rounded_corr = round(numerical_df.corr(method='pearson'),2)
plt.figure(figsize=(8,8))
ax = sns.heatmap(rounded_corr,cmap = 'BuPu', annot=True)
plt.show()
Observations
- pdays and previous seem to have some positive correlation between them
- previous and emp_var_rate seem to have a some medium negative correlation
- cons_price_idx, euribor3m and nr_employed seem to have very strong positive correlation with emp_var_rate
- eurobor3m, nr_employed seem to have some positive correlation with cons_price_idx
- previous has some negative correlation with euribor3m and nr_employed seem to have a high positive correlation with euribor3m
- previous has some negative correlation with nr_employed
- campaign and duration don't seem to have any significant correlation between them. This is in opposite to our supposition in univariate analysis that high campaign values would be related to high duration. So the outliers in these columns will be clipped to their respective highest whisker values.
sns.scatterplot(numerical_df['pdays'],numerical_df['previous'])
<AxesSubplot:xlabel='pdays', ylabel='previous'>
sns.scatterplot(numerical_df['previous'],numerical_df['emp_var_rate'])
<AxesSubplot:xlabel='previous', ylabel='emp_var_rate'>
sns.scatterplot(numerical_df['cons_price_idx'],numerical_df['emp_var_rate'])
<AxesSubplot:xlabel='cons_price_idx', ylabel='emp_var_rate'>
sns.scatterplot(numerical_df['euribor3m'],numerical_df['emp_var_rate'])
<AxesSubplot:xlabel='euribor3m', ylabel='emp_var_rate'>
sns.scatterplot(numerical_df['nr_employed'],numerical_df['emp_var_rate'])
<AxesSubplot:xlabel='nr_employed', ylabel='emp_var_rate'>
sns.scatterplot(numerical_df['cons_price_idx'],numerical_df['euribor3m'])
<AxesSubplot:xlabel='cons_price_idx', ylabel='euribor3m'>
sns.scatterplot(numerical_df['cons_price_idx'],numerical_df['nr_employed'])
<AxesSubplot:xlabel='cons_price_idx', ylabel='nr_employed'>
sns.scatterplot(numerical_df['previous'],numerical_df['euribor3m'])
<AxesSubplot:xlabel='previous', ylabel='euribor3m'>
sns.scatterplot(numerical_df['previous'],numerical_df['nr_employed'])
<AxesSubplot:xlabel='previous', ylabel='nr_employed'>
sns.scatterplot(numerical_df['euribor3m'],numerical_df['nr_employed'])
<AxesSubplot:xlabel='euribor3m', ylabel='nr_employed'>
Observations¶
- Although the correlation heatmap gave us several high values for correlations, the scatter plot doesn't seem to indicate the same as the values of most of the numeric features are discrete. Hence, we won't be deleting any of the highly correlated features for our modelling purpose.
2. Categorical-Categorical¶
LIST OF HYPOTHESIS TO CHECK¶
Chi-square results
- Do
job levelhave a significant impact on theterm deposit subscription? - Do
marital statushave a significant impact on theterm deposit subscription? - Do
education levelhave a significant impact on theterm deposit subscription? - Do
defaulthave a significant impact on theterm deposit subscription? - Do
housing loanshave a significant impact on theterm deposit subscription? - Do
personal loanshave a significant impact on theterm deposit subscription? - Do
region_codehave a significant impact on theterm deposit subscription? - Do
contact typehave a significant impact on theterm deposit subscription? - Do
month of the yearhave a significant impact on theterm deposit subscription? - Do
day_of_weekhave a significant impact on theterm deposit subscription? - Do
previous campaign outcomehave a significant impact on theterm deposit subscription?
pd.options.plotting.backend = 'matplotlib'
from scipy.stats import chi2_contingency
def BVA_categorical_plot(data, tar, cat):
data = data[[cat,tar]][:]
table = pd.crosstab(data[tar],data[cat],)
f_obs = np.array([table.iloc[0][:].values,
table.iloc[1][:].values])
chi, p, dof, expected = chi2_contingency(f_obs)
if p<0.05:
sig = True
else:
sig = False
plt.figure(figsize=(12,6))
sns.countplot(x=cat, hue=tar, data=data)
plt.title("p-value = {}\n difference significant? = {}\n Chi-square = {}".format(round(p,5),sig,round(chi,2)))
dd = data.groupby(cat)[tar].value_counts(normalize=True).unstack()*100
dd.plot(kind='bar', stacked='True',title=str(dd),figsize=(12,6))
BVA_categorical_plot(categorical_df,'y','job')
Observations
- As the p-values is 0 which is <0.05. This indicates that the our null hypothesis(job type does not have any impact on the term deposit subscription) is rejected and hence different job types have different term deposit subscription
- Students and retired people have the largest proportion of subscribers to our policy
BVA_categorical_plot(categorical_df,'y','marital')
Observations
- As the p-values is 0 which is <0.05. This indicates that the our null hypothesis(marital status does not have any impact on the term deposit subscription) is rejected and hence whether the person is married, single or divorced have different term deposit subscription
- Single people seem to have more term deposits than the other marital statuses
BVA_categorical_plot(categorical_df,'y','education')
Observations
- As the p-values is 0 which is <0.05. This indicates that the our null hypothesis(education level does not have any impact on the term deposit subscription) is rejected and hence different education levels have different amount of term deposit subscribers
- People having university degree seem to subscribe our policy the most
BVA_categorical_plot(categorical_df,'y','default')
Observations
- As the p-values is 0 which is <0.05. This indicates that the our null hypothesis(default does not have any impact on the term deposit subscription) is rejected and hence whether the person has defaulted in the past or not affects the term deposit subscription
- Those who do not default seem to have the most term deposits
BVA_categorical_plot(categorical_df,'y','housing')
Observations
- As the p-values is 0.03 which is <0.05. This indicates that the our null hypothesis(people having housing loan does not have any impact on the term deposit subscription) is rejected and hence whether the person has a housing loan or not affects the term deposit subscription
- Those who have housing loans seem to have the most term deposits
BVA_categorical_plot(categorical_df,'y','loan')
Observations
- As the p-values is 0.32 which is >0.05. This indicates that we fail to reject our null hypothesis(people having housing loan does not have any impact on the term deposit subscription). Hence whether the person has a personal loan or not does not affect the term deposit subscription
BVA_categorical_plot(categorical_df,'y','region_code')
Observations
- As the p-values is 0.21 which is >0.05. This indicates that we fail to reject our null hypothesis(people living in different regions does not have any impact on the term deposit subscription). Hence whether the person lives in region 1,2,3 or 4 does not affect the term deposit subscription
BVA_categorical_plot(categorical_df,'y','contact')
Observations
- As the p-values is 0 which is <0.05. This indicates that we reject our null hypothesis(people contacted via telephone or mobile does not have any impact on the term deposit subscription)
- People contacted via mobile phones tend to have a higher subscription rate
BVA_categorical_plot(categorical_df,'y','month')
Observations
- As the p-values is 0 which is <0.05. This indicates that we reject our null hypothesis(month of campaign does not have any impact on the term deposit subscription) and hence different months have different subscription rates
- The months of march, sept, oct and dec tend to have a higher subscription rate
BVA_categorical_plot(categorical_df,'y','day_of_week')
Observations
- As the p-values is 2e-5 which is <0.05. This indicates that we reject our null hypothesis(day of the week does not have any impact on the term deposit subscription) and hence different days have different subscription rates
- Thursdays tend to have the highest subscription rate
BVA_categorical_plot(categorical_df,'y','poutcome')
Observations
- As the p-values is 0 which is <0.05. This indicates that we reject our null hypothesis(outcome of previous campaign does not have any impact on the term deposit subscription) and hence previous campaigns do affect our subscription rates
- People who subscribed in the previous campaign tend to have the highest subscription rate in the current campaign as well
3. Categorical-Numerical¶
LIST OF HYPOTHESIS TO CHECK¶
- Are the average
ageof subscribed customer equal to that of unsubscribed customer? - Are the average
duration of contactof subscribed customer equal to that of unsubscribed customer? - Are the average
current campaign contactsof subscribed customer equal to that of unsubscribed customer? - Are the average
number of days of contact after previous campaignof subscribed customer equal to that of attrited customer? - Are the average
previous campiagn contactsof subscribed customer equal to that of unsubscribed customer? - Are the average
emp_var_rateof subscribed customer equal to that of unsubscribed customer? - Are the average
cons_price_idxof subscribed customer equal to that of unsubscribed customer? - Are the average
cons_conf_idxof subscribed customer equal to that of unsubscribed customer? - Are the average
euribor3mof subscribed customer equal to that of unsubscribed customer? - Are the average
nr_employedof subscribed customer equal to that of unsubscribed customer?
from numpy import sqrt, abs, round
from scipy.stats import norm
from scipy.stats import t as t_dist
def TwoSampZ(X1, X2, sigma1, sigma2, N1, N2):
ovr_sigma = sqrt(sigma1**2/N1 + sigma2**2/N2)
z = (X1 - X2)/ovr_sigma
pval = 2*(1 - norm.cdf(abs(z)))
return pval
'''''
Testing if they are equal or not. If p_value < 0.05 we reject the null hypothesis
'''''
def TwoSampT(X1, X2, sd1, sd2, n1, n2):
ovr_sd = sqrt(sd1**2/n1 + sd2**2/n2)
t = (X1 - X2)/ovr_sd
df = n1+n2-2
pval = 2*(1 - t_dist.cdf(abs(t),df))
return pval
'''''
Testing if they are equal or not. If p_value < 0.05 we reject the null hypothesis
'''''
def Bivariate_cont_cat(data, cont, cat, category):
x1 = data[cont][data[cat]==category][:]
x2 = data[cont][~(data[cat]==category)][:]
n1, n2 = x1.shape[0], x2.shape[0]
m1, m2 = x1.mean(), x2.mean()
std1, std2 = x1.std(), x2.mean()
t_p_val = round(TwoSampT(m1, m2, std1, std2, n1, n2),4)
z_p_val = round(TwoSampZ(m1, m2, std1, std2, n1, n2),4)
table = pd.pivot_table(data=data, values=cont, columns=cat, aggfunc = np.mean)
table = round(table,2)
plt.figure(figsize = (15,6), dpi=140)
plt.subplot(1,2,1)
sns.barplot([str(category),'not {}'.format(category)], [m1, m2])
plt.ylabel('mean {}'.format(cont))
plt.xlabel(cat)
plt.title('t-test p-value = {} \n z-test p-value = {}\n {}'.format(t_p_val, z_p_val, table))
plt.subplot(1,2,2)
sns.boxplot(x=cat, y=cont, data=data)
plt.title('categorical boxplot')
Bivariate_cont_cat(df, 'age', 'y', 'yes')
Observations
- Since the p-value for both t-test and z-test are 0.0007<0.05. Hence we reject the null hypothesis. So, the mean age of customers who subscribed and do not subscribe are statistically different
Bivariate_cont_cat(df, 'duration', 'y', 'yes')
Observations
- Since the p-value for both t-test and z-test are 0<0.05. Hence we reject the null hypothesis. So, the mean duration of contact for customers who subscribed and do not subscribe are statistically different
Bivariate_cont_cat(df, 'campaign', 'y', 'yes')
Observations
- Since the p-value for both t-test and z-test are 0<0.05. Hence we reject the null hypothesis. So, the mean number of times of contact in current campaign for customers who subscribed and do not subscribe are statistically different
Bivariate_cont_cat(df, 'pdays', 'y', 'yes')
Observations
- Since the p-value for both t-test and z-test are 0<0.05. Hence we reject the null hypothesis. So, the average amount of days after which the customers are called from the previous campiagn for customers who subscribed and do not subscribe are statistically different
Bivariate_cont_cat(df, 'previous', 'y', 'yes')
Observations
- Since the p-value for both t-test and z-test are 0<0.05. Hence we reject the null hypothesis. So, the mean number of times of contact in previous campaign for customers who subscribed and do not subscribe are statistically different
Bivariate_cont_cat(df, 'emp_var_rate', 'y', 'yes')
Observations
- Since the p-value for both t-test and z-test are 0<0.05. Hence we reject the null hypothesis. So, the average emp_var_rates for customers who subscribed and do not subscribe are statistically different
Bivariate_cont_cat(df, 'cons_price_idx', 'y', 'yes')
Observations
- Since the p-value for both t-test and z-test are 0.6303>0.05. Hence we fail to reject the null hypothesis. So, the average cons_price_idx for customers who subscribed and do not subscribe are statistically similar
Bivariate_cont_cat(df, 'cons_conf_idx', 'y', 'yes')
Observations
- Since the p-value for both t-test and z-test are 0.0011<0.05. Hence we reject the null hypothesis. So, the average cons_conf_idx for customers who subscribed and do not subscribe are statistically different
Bivariate_cont_cat(df, 'euribor3m', 'y', 'yes')
Observations
- Since the p-value for both t-test and z-test are 0<0.05. Hence we reject the null hypothesis. So, the average euribor3m index for customers who subscribed and do not subscribe are statistically different
Bivariate_cont_cat(df, 'nr_employed', 'y', 'yes')
Observations
- Since the p-value for both t-test and z-test are 0.0045<0.05. Hence we reject the null hypothesis. So, the average nr_employed index for customers who subscribed and do not subscribe are statistically different
Step3: Outlier Treatment and Data Transformation¶
df
| customer_id | age | job | marital | education | default | housing | loan | region_code | state_code | ... | pdays | previous | poutcome | emp_var_rate | cons_price_idx | cons_conf_idx | euribor3m | nr_employed | postal_code | y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 56 | services | married | high.school | no | no | yes | 3 | S1 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 42420 | no |
| 1 | 2 | 45 | services | married | basic.9y | unknown | no | no | 3 | S1 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 42420 | no |
| 2 | 3 | 59 | admin | married | professional.course | no | no | no | 4 | S2 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 90036 | no |
| 3 | 4 | 41 | blue-collar | married | university.degree | unknown | no | no | 3 | S3 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 33311 | no |
| 4 | 5 | 24 | technician | single | professional.course | no | yes | no | 3 | S3 | ... | -1 | 0 | nonexistent | 1.1 | 93.994 | -36.4 | 4.857 | 5191.0 | 33311 | no |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 37079 | 37080 | 73 | retired | married | professional.course | no | yes | no | 2 | S16 | ... | -1 | 0 | nonexistent | -1.1 | 94.767 | -50.8 | 1.028 | 4963.6 | 10009 | yes |
| 37080 | 37081 | 46 | blue-collar | married | professional.course | no | no | no | 2 | S16 | ... | -1 | 0 | nonexistent | -1.1 | 94.767 | -50.8 | 1.028 | 4963.6 | 10011 | no |
| 37081 | 37082 | 56 | retired | married | university.degree | no | yes | no | 2 | S16 | ... | -1 | 0 | nonexistent | -1.1 | 94.767 | -50.8 | 1.028 | 4963.6 | 10009 | no |
| 37082 | 37083 | 44 | technician | married | professional.course | no | no | no | 4 | S17 | ... | -1 | 0 | nonexistent | -1.1 | 94.767 | -50.8 | 1.028 | 4963.6 | 85254 | yes |
| 37083 | 37084 | 74 | retired | married | professional.course | no | yes | no | 1 | S6 | ... | -1 | 1 | failure | -1.1 | 94.767 | -50.8 | 1.028 | 4963.6 | 79109 | no |
37084 rows × 26 columns
Outlier Treatment for Age, Duration, Campaign Columns (As per our insights from EDA)
# Defing remove_outlier function which returns lower_range and Upper_range of IQR
def remove_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range
lr,ur=remove_outlier(df['duration'])
#df = df[df['duration']>=lr]
#df = df[df['duration']<=ur]
df['duration'][df['duration']<lr] = lr
df['duration'][df['duration']>ur] = ur
lr,ur=remove_outlier(df['campaign'])
df['campaign'][df['campaign']<lr] = lr
df['campaign'][df['campaign']>ur] = ur
lr,ur=remove_outlier(df['age'])
df['age'][df['age']<lr] = lr
df['age'][df['age']>ur] = ur
Dropping Postal Code to prevent Dimensionality Explosion and Overfitting
df.drop('postal_code', axis=1, inplace=True)
df.shape
(37084, 25)
One Hot Encoding of Object Columns (as they are nominal in nature)
df =pd.get_dummies(df, columns=['job',
'marital',
'education',
'default',
'housing',
'loan',
'region_code',
'state_code',
'contact',
'month',
'day_of_week',
'city_code',
'poutcome',
'y'],drop_first=True)
df.shape
(37084, 631)
Data Transformation/Normalization of Numeric Columns using z-score
from scipy.stats import zscore
df[['age','duration','campaign','pdays','previous','emp_var_rate','cons_price_idx', 'cons_conf_idx','nr_employed','euribor3m']] = df[['age','duration','campaign','pdays','previous','emp_var_rate','cons_price_idx', 'cons_conf_idx','nr_employed','euribor3m']].apply(zscore)
df.describe()
| customer_id | age | duration | campaign | pdays | previous | emp_var_rate | cons_price_idx | cons_conf_idx | euribor3m | ... | city_code_C93 | city_code_C94 | city_code_C95 | city_code_C96 | city_code_C97 | city_code_C98 | city_code_C99 | poutcome_nonexistent | poutcome_success | y_yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 37084.000000 | 3.708400e+04 | 3.708400e+04 | 3.708400e+04 | 3.708400e+04 | 3.708400e+04 | 3.708400e+04 | 3.708400e+04 | 3.708400e+04 | 3.708400e+04 | ... | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 | 37084.000000 |
| mean | 18542.500000 | -1.231293e-16 | -1.073840e-15 | 2.788444e-15 | 1.661555e-15 | 1.230906e-13 | 1.447244e-13 | 5.848812e-14 | 9.814922e-14 | -8.836288e-14 | ... | 0.000539 | 0.003317 | 0.002103 | 0.001510 | 0.001915 | 0.002346 | 0.001160 | 0.863526 | 0.033411 | 0.113472 |
| std | 10705.373028 | 1.000013e+00 | 1.000013e+00 | 1.000013e+00 | 1.000013e+00 | 1.000013e+00 | 1.000013e+00 | 1.000013e+00 | 1.000013e+00 | 1.000013e+00 | ... | 0.023217 | 0.057497 | 0.045814 | 0.038831 | 0.043715 | 0.048380 | 0.034033 | 0.343296 | 0.179709 | 0.317173 |
| min | 1.000000 | -2.265290e+00 | -1.322932e+00 | -8.221931e-01 | -1.704507e-01 | -3.489902e-01 | -2.219709e+00 | -2.377031e+00 | -2.227359e+00 | -1.723043e+00 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 9271.750000 | -7.851027e-01 | -7.489512e-01 | -8.221931e-01 | -1.704507e-01 | -3.489902e-01 | -1.199935e+00 | -8.661874e-01 | -4.748648e-01 | -1.313573e+00 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 |
| 50% | 18542.500000 | -1.930276e-01 | -3.100249e-01 | -1.779476e-01 | -1.704507e-01 | -3.489902e-01 | 6.484045e-01 | 2.989255e-01 | -2.801432e-01 | 7.124386e-01 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 |
| 75% | 27813.250000 | 6.950849e-01 | 4.735709e-01 | 4.662980e-01 | -1.704507e-01 | -3.489902e-01 | 8.396120e-01 | 7.224457e-01 | 8.881865e-01 | 7.724174e-01 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 |
| max | 37084.000000 | 2.915366e+00 | 2.307354e+00 | 2.399035e+00 | 1.830813e+01 | 1.377317e+01 | 8.396120e-01 | 2.058695e+00 | 2.943581e+00 | 8.208617e-01 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 631 columns
Step 4: Model Building & Performance Analysis¶
Separating Independent and Dependent Variables¶
X = df.drop(['customer_id','y_yes'],axis=1)
y = df[['y_yes']]
Train-Test Split (test_size = 0.2)¶
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2 , random_state=42,stratify=y)
Model 1: Grid Search on Random Forest Classifier¶
from sklearn.model_selection import GridSearchCV
param_grid={
'n_estimators':[50,75,100,150],
'max_depth':[8,10,12,15],
'min_samples_leaf':[10,20,30],
}
grid_search_rf = GridSearchCV(estimator=rf,param_grid=param_grid,cv=3)
grid_search_rf.fit(X_train,y_train)
GridSearchCV(cv=3, estimator=RandomForestClassifier(max_depth=15),
param_grid={'max_depth': [8, 10, 12, 15],
'min_samples_leaf': [10, 20, 30],
'n_estimators': [50, 75, 100, 150]})
print(grid_search_rf.best_params_)
{'max_depth': 15, 'min_samples_leaf': 10, 'n_estimators': 50}
from sklearn import metrics
y_train_predict = grid_search_rf.predict(X_train)
model_score =grid_search_rf.score(X_train, y_train)
print(model_score)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))
0.902956146560151
[[26163 138]
[ 2741 625]]
precision recall f1-score support
0 0.91 0.99 0.95 26301
1 0.82 0.19 0.30 3366
accuracy 0.90 29667
macro avg 0.86 0.59 0.63 29667
weighted avg 0.90 0.90 0.87 29667
pred_prob = grid_search_rf.predict_proba(X_test)
grid_probs = pred_prob[:,1]
from sklearn.metrics import precision_recall_curve
precision, recall, threshold = precision_recall_curve(y_test,grid_probs)
f_score = (2*precision*recall)/(precision+recall)
ix = np.argmax(f_score)
print(f_score[ix])
print(threshold[ix])
0.5753768844221105 0.16508158723302946
y_test_predict= (grid_probs>=threshold[ix]).astype('int')
model_score = grid_search_rf.score(X_test, y_test)
print(model_score)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))
0.9040043144128354
[[5716 859]
[ 155 687]]
precision recall f1-score support
0 0.97 0.87 0.92 6575
1 0.44 0.82 0.58 842
accuracy 0.86 7417
macro avg 0.71 0.84 0.75 7417
weighted avg 0.91 0.86 0.88 7417
Model 2: Random Forest Classifier¶
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
RandomForestClassifier()
from sklearn import metrics
rf_train_predict = rf.predict(X_train)
model_score =rf.score(X_train, y_train)
print(model_score)
print(metrics.confusion_matrix(y_train, rf_train_predict))
print(metrics.classification_report(y_train, rf_train_predict))
1.0
[[26301 0]
[ 0 3366]]
precision recall f1-score support
0 1.00 1.00 1.00 26301
1 1.00 1.00 1.00 3366
accuracy 1.00 29667
macro avg 1.00 1.00 1.00 29667
weighted avg 1.00 1.00 1.00 29667
pred_prob = rf.predict_proba(X_test)
rf_probs = pred_prob[:,1]
from sklearn.metrics import precision_recall_curve
precision, recall, threshold = precision_recall_curve(y_test,rf_probs)
f_score = (2*precision*recall)/(precision+recall)
ix = np.argmax(f_score)
print(f_score[ix])
print(threshold[ix])
0.6184782608695651 0.33
rf_test_predict= (rf_probs>=threshold[ix]).astype('int')
model_score = rf.score(X_test, y_test)
print(model_score)
print(metrics.confusion_matrix(y_test, rf_test_predict))
print(metrics.classification_report(y_test, rf_test_predict))
0.9111500606714305
[[6146 429]
[ 273 569]]
precision recall f1-score support
0 0.96 0.93 0.95 6575
1 0.57 0.68 0.62 842
accuracy 0.91 7417
macro avg 0.76 0.81 0.78 7417
weighted avg 0.91 0.91 0.91 7417
importances = rf.feature_importances_
indices = np.argsort(importances)
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices))[-10:], importances[indices][-10:], color='b', align='center')
plt.yticks(range(len(indices))[-10:], X_train.columns[indices][-10:])
plt.xlabel('Relative Importance')
plt.show()
Model 3: Extreme Gradient Boosting Machine¶
# !pip install xgboost
import xgboost as xgb
xg_reg = xgb.XGBClassifier()
xg_reg.fit(X_train,y_train)
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
early_stopping_rounds=None, enable_categorical=False,
eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
importance_type=None, interaction_constraints='',
learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
missing=nan, monotone_constraints='()', n_estimators=100,
n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0,
reg_alpha=0, reg_lambda=1, ...)
from sklearn import metrics
xg_train_predict = xg_reg.predict(X_train)
model_score =xg_reg.score(X_train, y_train)
print(model_score)
print(metrics.confusion_matrix(y_train, xg_train_predict))
print(metrics.classification_report(y_train, xg_train_predict))
0.949303940405164
[[25827 474]
[ 1030 2336]]
precision recall f1-score support
0 0.96 0.98 0.97 26301
1 0.83 0.69 0.76 3366
accuracy 0.95 29667
macro avg 0.90 0.84 0.86 29667
weighted avg 0.95 0.95 0.95 29667
pred_prob = xg_reg.predict_proba(X_test)
xg_probs = pred_prob[:,1]
from sklearn.metrics import precision_recall_curve
precision, recall, threshold = precision_recall_curve(y_test,xg_probs)
f_score = (2*precision*recall)/(precision+recall)
ix = np.argmax(f_score)
print(f_score[ix])
print(threshold[ix])
0.6584992343032159 0.31435093
xg_test_predict= (xg_probs>=threshold[ix]).astype('int')
model_score = xg_reg.score(X_test, y_test)
print(model_score)
print(metrics.confusion_matrix(y_test, xg_test_predict))
print(metrics.classification_report(y_test, xg_test_predict))
0.916003775111231
[[6103 472]
[ 197 645]]
precision recall f1-score support
0 0.97 0.93 0.95 6575
1 0.58 0.77 0.66 842
accuracy 0.91 7417
macro avg 0.77 0.85 0.80 7417
weighted avg 0.92 0.91 0.92 7417
importances = xg_reg.feature_importances_
indices = np.argsort(importances)
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices))[-10:], importances[indices][-10:], color='b', align='center')
plt.yticks(range(len(indices))[-10:], X_train.columns[indices][-10:])
plt.xlabel('Relative Importance')
plt.show()
Ensembling the 3 models for better generalization and performance improvement¶
en_probs = grid_probs*0.2 + rf_probs*0.2 + xg_probs*0.6
from sklearn.metrics import precision_recall_curve
precision, recall, threshold = precision_recall_curve(y_test,en_probs)
f_score = (2*precision*recall)/(precision+recall)
ix = np.argmax(f_score)
print(f_score[ix])
print(threshold[ix])
0.6554702495201535 0.2538157994347361
en_test_predict= (en_probs>=threshold[ix]).astype('int')
print(metrics.confusion_matrix(y_test, en_test_predict))
print(metrics.classification_report(y_test, en_test_predict))
[[6016 559]
[ 159 683]]
precision recall f1-score support
0 0.97 0.91 0.94 6575
1 0.55 0.81 0.66 842
accuracy 0.90 7417
macro avg 0.76 0.86 0.80 7417
weighted avg 0.93 0.90 0.91 7417
Step 5: Model Insights and Recommendations¶
Model Insights¶
- Three models have been built and analysed for the current scenario.
- Grid Search CV on RandomForestClassifier
- RandomForestClassifier
- XGBoostClassifier
- SMOTE is not used as the original dataset has 90:10 type of class imbalance(too high) and synthetic data that will be generated will be bad.
- The purpose of using these three models is that GridSearchCV will help in hyperparameter tuning of the random forest classifier and will also include 3 fold cross-validation to ensure that the model generalizes well on unknown data. After this a typical RandomForest model with some hyperparameters(tuned using grid search but not shown here) are used. After this the XGBoost Classifier is used with default settings to as XGBoost can yield better results in case of classification due to boosting.
- From the feature importances of random forest classifier, we observe that the features of duration, social-economic indices, age, pdays, and previous campaign success are the most important and are given maximum importance while modelling.
- From the feature importances of xgboost classifier, we observe that the features of socio-economic indices, duration, success of previous campaign and contact mode as telephone are the most important and are given maximum importance while modelling.
- Here, recall score matters more than precision as we need to give more priority to incorrect False Negatives than to incorrect False Positives. This is because we want our model to predict most of the customers who actually turned up for the subscription from the campaign at the cost of even misclassifying some of the customers who actually didn't turn up for the subscription as positives. Hence, Recall is considered as the evaluation metric at the same time ensuring that the precision is not lost much through optimum f1_score.
- The results of the three models are then ensembled with appropriate weights to give us a model which combines the best world of all three models. Hence our final model has a recall score of 0.81 with f1_score of 0.66 and precision of of 0.55 for class 1 (see the classification report).
Recommendations and Suggestion for Improvment¶
- The Marketting team should invest their cost in calling the customers more than 3 times as majority (80%) of the term-deposit scheme subscribers have subscribed with 3 or less contacts.
- The cellular mode of contact is shown to be more effective than telephonic mode. This is because of the fact that majority of the customers prefer to use mobiles over telephones for communication.
- As shown in EDA, we could target our campiagns more towards students and retired individuals, as they have shown to have more proportion of subscribers for our term deposit than the rest of the job types. Also single people subsribe to our term deposit schemes more than married or divorced individuals as they are still in their phase of settling.
- We should also target those customers who haven't defaulted in the past as these seem to be interested in our subscribing our schemes more. This will also reduce the risk of default from our subscribers.
- We should also roll out most of the campaigns during the end of the year around september, october and december as thses months have shown high proportion of subsribers.
- EDA also shows us that the success of our previous campaign highly affects the conversion rates of our current campaign. Hence, the team should focus on investing more towards making every campaign a success as it would have some adstock in future campaigns too.
- The campaigns should also focus on selling their schemes to literate population rather than illeterate ones.